Hunting of the Snark: Finding Data Glitches using Data Mining Methods

نویسندگان

  • Tamraparni Dasu
  • Theodore Johnson
چکیده

Data quality is critical to data analysis because bad data can lead to incorrect conclusions. Problems with data are best detected early, before too much time and eeort are spent ingesting and analyzing it. In this paper, we propose the use of data mining techniques for the automatic detection of data problems commonly encountered in large multivariate data sets. Data mining methods are ideal for this purpose, since they are designed for nding abnormal patterns in large volumes of data. We discuss some important types of data integrity issues. We demonstrate the use of a data mining method, the DataSphere set comparison technique (from our earlier work 6]) to detect glitches that mimic the error conditions discussed, using artiicial data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hunting Data Glitches in Massive Time Series Data

In a previous paper [5] presented at IQ’99, we had proposed a method for isolating data glitches in massive data sets using a data mining method called DataSpheres. The technique runs in linear time, isolating sections of data that contain corrupted or abnormal data. In this paper, we propose using the DataSphere technique to isolate problems in time series data. We define two types of multivar...

متن کامل

Using a Data Mining Tool and FP-Growth Algorithm Application for Extraction of the Rules in two Different Dataset (TECHNICAL NOTE)

In this paper, we want to improve association rules in order to be used in recommenders. Recommender systems present a method to create the personalized offers. One of the most important types of recommender systems is the collaborative filtering that deals with data mining in user information and offering them the appropriate item. Among the data mining methods, finding frequent item sets and ...

متن کامل

Automated detection of coronavirus disease (COVID-19) by using data-mining techniques: a brief report

Background: The clinical field has vast sick data that has not been analyzed. Discovering a way to analyze this raw data and turn it into an information treasure can save many lives. Using data mining methods is an efficient way to analyze this large amount of raw data. It can predict the future with accurate knowledge of the past, providing new insights into disease diagnosis and prevention. S...

متن کامل

Prediction of Student Learning Styles using Data Mining Techniques

This paper focuses on the prediction of student learning styles using data mining techniques within their institutions. This prediction was aimed at finding out how different learning styles are achieved within learning environments which are specifically influenced by already existing factors. These learning styles, have been affected by different factors that are mainly engraved and found wit...

متن کامل

Identification of Fraud in Banking Data and Financial Institutions Using Classification Algorithms

In recent years, due to the expansion of financial institutions,as well as the popularity of the World Wide Weband e-commerce, a significant increase in the volume offinancial transactions observed. In addition to the increasein turnover, a huge increase in the number of fraud by user’sabnormality is resulting in billions of dollars in lossesover the world. T...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999